There are 1,599 observations of red wines with 12 recorded features for each observation. Some of the features are related to each other (e.g., those related to acidity). Quality is the only categorical feature.
## [1] "fixed.acidity" "volatile.acidity" "citric.acid"
## [4] "residual.sugar" "chlorides" "free.sulfur.dioxide"
## [7] "total.sulfur.dioxide" "density" "pH"
## [10] "sulphates" "alcohol" "quality"
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 4.60 Min. :0.1200 Min. :0.000 Min. : 0.900
## 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090 1st Qu.: 1.900
## Median : 7.90 Median :0.5200 Median :0.260 Median : 2.200
## Mean : 8.32 Mean :0.5278 Mean :0.271 Mean : 2.539
## 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420 3rd Qu.: 2.600
## Max. :15.90 Max. :1.5800 Max. :1.000 Max. :15.500
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.01200 Min. : 1.00 Min. : 6.00
## 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00
## Median :0.07900 Median :14.00 Median : 38.00
## Mean :0.08747 Mean :15.87 Mean : 46.47
## 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00
## Max. :0.61100 Max. :72.00 Max. :289.00
## density pH sulphates alcohol
## Min. :0.9901 Min. :2.740 Min. :0.3300 Min. : 8.40
## 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50
## Median :0.9968 Median :3.310 Median :0.6200 Median :10.20
## Mean :0.9967 Mean :3.311 Mean :0.6581 Mean :10.42
## 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10
## Max. :1.0037 Max. :4.010 Max. :2.0000 Max. :14.90
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.636
## 3rd Qu.:6.000
## Max. :8.000
## [1] 0
It was unclear what features will be useful at this point.
No, there didn’t seem to be much of a need to create new variables.
##
## 3 4 5 6 7 8
## 10 53 681 638 199 18
As can be seen above, the vast majority of wines are scored 5 or 6.
3’s and 4’s
## [1] 0.03939962
5’s and 6’s
## [1] 0.8248906
7’s and 8’s
## [1] 0.1357098
Scores of 5 and 6 account for over 82% percent of all scores! This suggests that the most useful information might be found by examining the lowest and highest scorers, but we’ll save that for later.
None of the features seemed unusual enough to explore futher and, no, I didn’t notice any need for tidying/adjusting the form of the data at this point.
Reason for this plot: I wanted to visualize the distribution of quality scores.
Comments: Clearly, most wines were given mid-range scores.
Examine correlations:
Reason for these plots: I wanted to visualize the relationship of categories to one another.
Comments: See below.
Alcohol, volatile.acidity, and sulphates seem to be the features most highly correlated with quality scores as they present correlation coefficient values furthest from zero, so let’s examine them further.
The following graphs each display the distribution of wines by one of these three categories (e.g, alcohol), but they also distinguish the distributions by quality score for the purposes of making apparent any visibly-noticeable relationship between category value (e.g., high alcohol) and quality score.
Reason for these plots: I wanted to visualize the distribution of each of these three categories as they related to quality scores.
Comments: See below.
The plots above each seem to suggest that lower or higher values in the three categories explored (alcohol, volatile.acidity, and sulphates) each have a relationship with wine quality scores, particularly in the case of the lowest and highest scores.
Aside from an inexplicable dip, alcohol mean appears to be positively semi-linearly correlated with quality scores. Quantile lines show a similar story.
Conversely, as the volatile.acidity mean (and each quantile) increases, quality scores decrease.
The sulphate relationship with quality scores mirrors the one with alcohol.
Reason for these plots: I wanted to visualize the relationship of the average values of each category to the quality scores.
Comments: See below.
Simply put: alcohol, volatile.acidity, and sulphates (particularly the first two) appear to have an affect of the quality scores. Alcohol will be discussed below, but, in general, the lower the volatile acidity, the higher the quality score; the inverse is true for sulphates and quality score.
I did not spend time looking at the other features because I’m focusing on answering the primary question driving this project.
Alcohol. Funny enough, a higher alcohol content seems to encourage a higher score.
As a continuation of the exploration of the three categories above, the boxplots below show in clear visible terms that the average category value (e.g., alcohol) is nicely correlated in each case with quality scores.
Reason for these plots: I wanted to visualize the relationship of the distribution and quantile values of each category to the quality scores.
Comments: There seems to be a definite relationship between each of these category values and the quality scores.
Reason for these plots: I wanted to visualize the trinary relationship of the of two category values (with alcohol as a constant) and the quality scores.
Comments: As a result of the two faceted plots above, a general observation may be made: low volatile.acidity, high sulphates, and high alcohol content appear to be related to high quality scores.
Reason for these plots: I wanted to visualize the trinary relationship of the of two category values (with alcohol as a constant) and the quality scores. In addition, I wanted to remove certain quality scores to attempt to visualize any clustering.
Comments: I see pretty solid visual evidence of clustering, further confirming the relationship between these three categories and quality scores.
Reason for this plot: Continuing the investigation of the above plots, I wanted to attempt to visualize all four types of values.
Comments: It’s difficult to see, but there does seem to be clustering with the addition of sulphates.
Reason for these plots: Continuing my investigation of clustering.
Comments: The contour lines help to confirm visually clustering. By faceting (using colour) the contour lines by quality score, pretty obvious quality score clustering becomes apparent.
As my examination continued, I felt better and better about the apparent relationship between alcohol, volatile.acidity, and sulphates and quality scores.
I found it interesting that sulphate levels seem to have a sweet spot when it comes to quality scores.
##
## Calls:
## m1: lm(formula = quality ~ alcohol, data = reds)
## m2: lm(formula = quality ~ alcohol + volatile.acidity, data = reds)
## m3: lm(formula = quality ~ alcohol + volatile.acidity + sulphates,
## data = reds)
## m4: lm(formula = quality ~ alcohol + volatile.acidity + sulphates,
## data = reds)
## m5: lm(formula = quality ~ alcohol:volatile.acidity:sulphates, data = reds)
## m6: lm(formula = quality ~ alcohol * volatile.acidity * sulphates,
## data = reds)
##
## ===================================================================================================
## m1 m2 m3 m4 m5 m6
## ---------------------------------------------------------------------------------------------------
## (Intercept) 1.875*** 3.095*** 2.611*** 2.611*** 5.763*** 1.285
## (0.175) (0.184) (0.196) (0.196) (0.058) (2.188)
## alcohol 0.361*** 0.314*** 0.309*** 0.309*** 0.426*
## (0.017) (0.016) (0.016) (0.016) (0.209)
## volatile.acidity -1.384*** -1.221*** -1.221*** 9.044*
## (0.095) (0.097) (0.097) (4.030)
## sulphates 0.679*** 0.679*** 2.713
## (0.101) (0.101) (3.226)
## alcohol x volatile.acidity x sulphates -0.036* 1.524*
## (0.015) (0.593)
## alcohol x volatile.acidity -0.996*
## (0.389)
## alcohol x sulphates -0.184
## (0.309)
## volatile.acidity x sulphates -15.622*
## (6.130)
## ---------------------------------------------------------------------------------------------------
## R-squared 0.227 0.317 0.336 0.336 0.003 0.351
## adj. R-squared 0.226 0.316 0.335 0.335 0.003 0.349
## sigma 0.710 0.668 0.659 0.659 0.806 0.652
## F 468.267 370.379 268.912 268.912 5.414 123.160
## p 0.000 0.000 0.000 0.000 0.020 0.000
## Log-likelihood -1721.057 -1621.814 -1599.384 -1599.384 -1923.929 -1580.453
## Deviance 805.870 711.796 692.105 692.105 1038.644 675.909
## AIC 3448.114 3251.628 3208.768 3208.768 3853.857 3178.905
## BIC 3464.245 3273.136 3235.654 3235.654 3869.988 3227.300
## N 1599 1599 1599 1599 1599 1599
## ===================================================================================================
Since model 6 had the highest R^2 value, I tested it with some obvious extreme cases (based on what seems to have been discovered above) using only values for alcohol, volatile.acidity, and sulphates:
## fit lwr upr
## 1 7.43747 6.1107 8.764241
## fit lwr upr
## 1 5.538351 4.259407 6.817296
## fit lwr upr
## 1 4.845372 3.271417 6.419327
As should be somewhat expected from the entire investigation so far, combined with the not-too-shabby R^2 value of the simple linear regression model we selected, these predictions were spot on.
Strength of this model: it works for the obvious cases. Weakness of this model: it’s unclear how robust it is.
This plot makes it easy to see the distribution of quality scores (most are in the middle), the rightward trend of alcohol content, the downward slope of volatile.acidity, and the mid-range sweet-spot of sulphates levels all in relation to quality scoring.
Although alcohol and quality are swapped from their perhaps expected axis locations, the swapping, along with the smoothing line, makes it clear that as quality increases, so do alcohol content (and, thus, the reverse relationship is true). Volatile.acidity and sulphates continue to play supporting roles.
Perhaps my favorite plot, borrowing from the experimentation with contours earlier, this plot, although leaving out sulphates, makes it clear that there are distinct clusters of quality scores that are quite obviously related to volatile.acidity and alcohol levels. If I were given a new red wine with only those two features listed, I would be very confident using merely this plot to predict the quality score (assuming the same wine experts responsible for this data set).
A fruitful exercise, this project exposed two or three features of red wines that, when related to one another, seem to lead to obvious groupings. Alcohol, volatile.acidity, and sulphates (in that order) appear to affect the (perceived) quality of red wines, at least among those wine experts consulted in the making of this data set.
Figuring out how to use R was the biggest difficulty for me with this project. Now that I have pretty interesting prelimary results, it would be good to attempt to collect a new sample of data to see if similar results could be found again.